Skip to content

Conversation

wking
Copy link
Member

@wking wking commented Jun 18, 2025

Change management notification sent on the 3rd of July.

OCPBUGS-57348 is the 4.20 bug, and we don't have a 4.19.z clone yet, but we can update this entry once that 4.19.z OCPBUGS-... exists.

Version(s): 4.19

Issue: OCPBUGS-57348

Docs preview.

QE review:

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 18, 2025
@openshift-ci-robot
Copy link

@wking: This pull request references Jira Issue OCPBUGS-57348, which is invalid:

  • expected Jira Issue OCPBUGS-57348 to depend on a bug in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

OCPBUGS-57348 is the 4.20 bug, and we don't have a 4.19.z clone yet, but we can update this entry once that 4.19.z OCPBUGS-... exists.

Version(s): 4.19

Issue: [OCPBUGS-57348](https://issues.redhat.com/browse/OCPBUGS-57348

Link to docs preview:

QE review:

  • QE has approved this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 18, 2025
@openshift-ci openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 18, 2025
@wking
Copy link
Member Author

wking commented Jun 18, 2025

@openshift/team-documentation, I'm not sure what the add-a-known-issue process is; can you help me get this ingested?

@ocpdocs-previewbot
Copy link

ocpdocs-previewbot commented Jun 18, 2025

🤖 Sat Jun 28 09:22:17 - Prow CI generated the docs preview:

https://94987--ocpdocs-pr.netlify.app/openshift-enterprise/latest/release_notes/ocp-4-19-release-notes.html

@wking wking force-pushed the boot-image-clobber-known-issue branch from df43ab0 to 72d6586 Compare June 18, 2025 23:27

* In {product-title} {product-version}, clusters using IPsec for network encryption might experience intermittent loss of pod-to-pod connectivity. This prevents some pods on certain nodes from reaching services on other nodes, resulting in connection timeouts. Internal testing could not reproduce this issue on clusters with 120 nodes or less. There is no workaround for this issue. (link:https://issues.redhat.com/browse/OCPBUGS-55453[OCPBUGS-55453])

* {product-title} clusters that are installed on {aws-short} with custom AMIs or on {gcp-short} with custom disk images will have those customizations overridden by boot image management. To recover, you must xref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[disable boot image management], restore your MachineSet boot images, and delete any Machines created with an undesired boot image.(link:https://issues.redhat.com/browse/OCPBUGS-57348[OCPBUGS-57348])
Copy link
Contributor

@dfitzmau dfitzmau Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* {product-title} clusters that are installed on {aws-short} with custom AMIs or on {gcp-short} with custom disk images will have those customizations overridden by boot image management. To recover, you must xref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[disable boot image management], restore your MachineSet boot images, and delete any Machines created with an undesired boot image.(link:https://issues.redhat.com/browse/OCPBUGS-57348[OCPBUGS-57348])
* If you install a cluster on {aws-short} that has Amazon Machine Images (AMI) enabled or on {gcp-short} that has custom disk images enabled, the boot image management overrides these customization images with boot images. As a workaround, you can disable the boot image management feature, restore the boot images for the machine sets to their original location, and delete any machines that were incorrectly generated by the overriding boot images. To disable boot image management, see ref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[disable boot image management]. (link:https://issues.redhat.com/browse/OCPBUGS-57348[OCPBUGS-57348])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns with that suggestion, but in the interest of getting something declared as a known-issue, I've adopted your wording (and pivoted from the 4.20 OCPBUGS-57348 to the 4.19.z OCPBUGS-57796, now that that backport tracker exists) with 72d6586 -> cf256d9. See the cf256d9 commit message for more on my concerns. I'll leave it up to docs folks if any of them are worth follow-up pull requests.

/label merge-review-needed

@dfitzmau dfitzmau added the peer-review-done Signifies that the peer review team has reviewed this PR label Jun 19, 2025
@dfitzmau
Copy link
Contributor

Thanks for raising this PR, @wking . I added a comment that revised the note.

@dfitzmau
Copy link
Contributor

When incorporated exactly as is or some variant, you can add a /label merge-review-needed comment to this PR, to place the PR in the merge queue.

wking added 2 commits June 24, 2025 15:40
…obber rewording

Pulling in the suggestion from [1], word for word (although I am
bumping the bug link to OCPBUGS-57796, now that that 4.19.z backport
tracker exists), to try to get something declared.  We can wordsmith
later if we want.  Personally, I have concerns, including:

> ... the boot image management overrides these customization images
> with boot images.

but "with boot images" doesn't make sense to me, because MachineSets
are going to reference boot images regardless; there's no "without
boot images" option.  The distinction is that sometimes the boot
images are specifically selected by the cluster admin, and the MCO's
boot image management would override those cluster-admin preferences.

> ...restore the boot images for the machine sets to their original
> location...

I'd prefer "previous value" or something to "original location".  I
haven't heard folks say "location" for an AMI ID or other MachineSet
property value, while I have heard "value" for that.  And it seems
like folks might confuse "original" as "what the MachineSet used when
it was created" when what we mean was "what the MachineSet used just
before the MCO clobbered its boot image value".

> ...delete any machines that were incorrectly generated by the
> overriding boot images.

I'd prefer "overriden boot images" or my "undesired boot images",
because the timeline there is:

1. MCO overrides the MachineSet's boot image configuration, inserting
   a stock boot image ID instead of the admin-preferred boot image ID.
2. Admin updates MachineConfiguration to disable MCO boot image
   management for that MachineSet.  After this, all overriding will be
   past-tense.  Cluster-admin preference for the admin-preferred boot
   image over the stock boot image can continue in the present-tense.
3. Admin restores MachineSet boot image ID configuration.
4. Admin deletes any Machines which launched from the stock boot
   image.

So I don't like the present-tense "overriding" in wording about step
(4).

It also feels weird to me to have:

> ...you can disable the boot image management feature...

as a non-link, with a later:

> To disable boot image management, see
> ref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[disable
> boot image management].

giving the link.  I'd have expected some of the folks reading the
initial words to think "huh, how do I do this step?", and having the
words they were wondering about be the link to the docs that would
clear them up seems more usable than requiring them to skim to the end
of the paragraph to find the link they need.  But maybe there are doc
conventions around this whose motivation I don't understand?

Anyhow, none of my concerns seem large enough to be worth delaying a
known-issue declaration over, so I'm adopting Darragh's wording in
this commit, pointing out the places I don't agree with in this commit
message, and leaving it up to the docs folks to decide if any of my
concerns are worth further word-smithing in future pull requests.

[1]: openshift#94987 (comment)
@wking wking force-pushed the boot-image-clobber-known-issue branch from 72d6586 to cf256d9 Compare June 24, 2025 22:56
@openshift-ci openshift-ci bot added the merge-review-needed Signifies that the merge review team needs to review this PR label Jun 24, 2025
@maxwelldb maxwelldb added the merge-review-in-progress Signifies that the merge review team is reviewing this PR label Jun 25, 2025
@maxwelldb maxwelldb added this to the Continuous Release milestone Jun 25, 2025
@maxwelldb maxwelldb self-requested a review June 25, 2025 14:30

* In {product-title} {product-version}, clusters using IPsec for network encryption might experience intermittent loss of pod-to-pod connectivity. This prevents some pods on certain nodes from reaching services on other nodes, resulting in connection timeouts. Internal testing could not reproduce this issue on clusters with 120 nodes or less. There is no workaround for this issue. (link:https://issues.redhat.com/browse/OCPBUGS-55453[OCPBUGS-55453])

* If you install a cluster on {aws-short} that has Amazon Machine Images (AMI) enabled or on {gcp-short} that has custom disk images enabled, the boot image management overrides these customization images with boot images. As a workaround, you can disable the boot image management feature, restore the boot images for the machine sets to their original location, and delete any machines that were incorrectly generated by the overriding boot images. To disable boot image management, see ref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[disable boot image management]. (link:https://issues.redhat.com/browse/OCPBUGS-57796[OCPBUGS-57796])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* If you install a cluster on {aws-short} that has Amazon Machine Images (AMI) enabled or on {gcp-short} that has custom disk images enabled, the boot image management overrides these customization images with boot images. As a workaround, you can disable the boot image management feature, restore the boot images for the machine sets to their original location, and delete any machines that were incorrectly generated by the overriding boot images. To disable boot image management, see ref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[disable boot image management]. (link:https://issues.redhat.com/browse/OCPBUGS-57796[OCPBUGS-57796])
If you install a cluster on {aws-short} with Amazon Machine Images (AMI) enabled, or on {gcp-short} with custom disk images enabled, the boot image management feature overrides these custom images with default boot images.
+
As a workaround, you can disable the boot image management feature, restore the original boot images for the machine sets, and delete any machines that were incorrectly created by the overriding boot images.
+
To disable boot image management, see ref:../machine_configuration/mco-update-boot-images.adoc#mco-update-boot-images-disable_machine-configs-configure[Disable boot image management].
+
(link:https://issues.redhat.com/browse/OCPBUGS-57796[OCPBUGS-57796])

I'm thinking something like this? Be sure to remove doubled spaces and clarify the first clause, regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like this?

Fine with me; pushed with cf256d9 -> a35fd43. I also have Allow edits and access to secrets by maintainers checked, so folks can push the rewords themselves, or close my pull and open your own? I really don't care what the words are here, it just seems like an issue that is worth a known-issue declaration. If there's a way I can delegate known-issue declaration to someone who has more time to shepherd the change through to landing, please let me know.

Copy link
Contributor

@maxwelldb maxwelldb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion left.

Because this impacts published release notes, it will need to go through the change management process. Be sure to indicate QE approval in the checkbox in the PR description, too. Thanks!

@maxwelldb maxwelldb removed merge-review-in-progress Signifies that the merge review team is reviewing this PR merge-review-needed Signifies that the merge review team needs to review this PR labels Jun 25, 2025
…r rewording

Copy-pasted from Max's comment [1].

If we spend too long talking about the wording, the fix will be out
before we've warned anyone about the bug.

[1]: openshift#94987 (comment)
Copy link

openshift-ci bot commented Jun 28, 2025

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gpei
Copy link

gpei commented Jun 29, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2025
@openshift-ci-robot
Copy link

@wking: This pull request references Jira Issue OCPBUGS-57348, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead
  • expected Jira Issue OCPBUGS-57348 to depend on a bug in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

OCPBUGS-57348 is the 4.20 bug, and we don't have a 4.19.z clone yet, but we can update this entry once that 4.19.z OCPBUGS-... exists.

Version(s): 4.19

Issue: OCPBUGS-57348

Docs preview.

QE review:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gpei
Copy link

gpei commented Jul 3, 2025

@dfitzmau @maxwelldb Do you happen to know what is missing to merge this doc PR? The above information says that the status of the linked Jira issue is invalid, but it actually links to the Installer bug related to this PR, which was already verified and closed. Maybe we don't need a Jira issue to merge this at all, or there are other ways to manually merge the current PR. Thanks.

@dfitzmau
Copy link
Contributor

dfitzmau commented Jul 3, 2025

@dfitzmau @maxwelldb Do you happen to know what is missing to merge this doc PR? The above information says that the status of the linked Jira issue is invalid, but it actually links to the Installer bug related to this PR, which was already verified and closed. Maybe we don't need a Jira issue to merge this at all, or there are other ways to manually merge the current PR. Thanks.

Hi @gpei . A bug exists on GitHub, but this is fixed by replacing "-" with "#" in the title of the PR. I did this. A change management would be needed but docs folks would be more versed with this, so I'll send out a notification to wider OCP members. If we don't get a response within 24-48 hours, we can move to getting the PR merged.

@dfitzmau dfitzmau changed the title OCPBUGS-57348: release_notes/ocp-4-19-release-notes: Add a boot-image-clobber known issue OCPBUGS#57348: release_notes/ocp-4-19-release-notes: Add a boot-image-clobber known issue Jul 3, 2025
@dfitzmau dfitzmau removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 3, 2025
@openshift-ci-robot openshift-ci-robot removed jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jul 3, 2025
@openshift-ci-robot
Copy link

@wking: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

In response to this:

OCPBUGS-57348 is the 4.20 bug, and we don't have a 4.19.z clone yet, but we can update this entry once that 4.19.z OCPBUGS-... exists.

Version(s): 4.19

Issue: OCPBUGS-57348

Docs preview.

QE review:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dfitzmau
Copy link
Contributor

dfitzmau commented Jul 3, 2025

Hi @wking and @gpei . I send out a change management notifcation. We should give it a day or so to answer any responses and then we can get this PR merged. Thanks for your contribution!

@jianli-wei
Copy link

@wking @dfitzmau 4.19.2 should have the fix already, please see my comment, thanks!

cc @gpei

@gpei
Copy link

gpei commented Jul 3, 2025

@dfitzmau thanks for the quick help on this!

@dfitzmau
Copy link
Contributor

dfitzmau commented Jul 3, 2025

Hi @wking and @gpei . I send out a change management notifcation. We should give it a day or so to answer any responses and then we can get this PR merged. Thanks for your contribution!

Thanks, Jianli. I see that the 4.19.2 z-stream Images advisory lists the bug fix:

Screenshot From 2025-07-03 12-10-26

So the known issue will still exist for 4.19.0 and 4.19.1?

@gpei
Copy link

gpei commented Jul 3, 2025

@dfitzmau

So the known issue will still exist for 4.19.0 and 4.19.1

yes

@wking
Copy link
Member Author

wking commented Jul 7, 2025

4.19.2 shipped last Tuesday, and RHSA-2025:9750 and the 4.19.2 release notes both mention OCPBUGS-57796. As far as I can tell, nobody is telling folks who installed with 4.19.0 or 4.19.1 with custom AWS or GCP boot images that they likely got bit, but 🤷 , I have no idea what magic is needed to get this pull merged, so I'm giving up and closing it. Good luck to anyone else who feels like customers might want to hear about the exposure/recovery process for those early 4.19s.

@wking wking closed this Jul 7, 2025
@wking wking deleted the boot-image-clobber-known-issue branch July 7, 2025 19:43
@dfitzmau
Copy link
Contributor

dfitzmau commented Jul 8, 2025

4.19.2 shipped last Tuesday, and RHSA-2025:9750 and the 4.19.2 release notes both mention OCPBUGS-57796. As far as I can tell, nobody is telling folks who installed with 4.19.0 or 4.19.1 with custom AWS or GCP boot images that they likely got bit, but 🤷 , I have no idea what magic is needed to get this pull merged, so I'm giving up and closing it. Good luck to anyone else who feels like customers might want to hear about the exposure/recovery process for those early 4.19s.

Hi @wking . We could still document the known issue if it's still in scope for 4.19.1 and 4.19.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/enterprise-4.19 lgtm Indicates that a PR is ready to be merged. peer-review-done Signifies that the peer review team has reviewed this PR size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants